The purpose of this supplementary file is to reproduce data analysis of artefact compositions and isotopic signals of objects from the Duchcov hoard from the article Exceptional ritual or a production deposit? The Duchcov hoard (Bohemia) as a proxy to ´Celtic migrations´ in the 4th century BC.
R version 3.6.3 was used (R Core Team, 2020) to run the analysis.
Following R packages are used:
tidyverse family of packages (Wickham et al., 2019),
ggforce (Pedersen, 2019),
ggbiplot (Vu, 2011),
ggridges (Wilke, 2020),
cluster (Maechler et al., 2019),
pdist (Wong, 2013),
StatMatch (D’Orazio, 2019),
MASS (Venables and Ripley, 2002),
cowplot (Wilke, 2019),
corrplot (Wei and Simko, 2017),
knitr (Xie, 2020a), DT (Xie et al., 2020), bookdown (Xie, 2020b),
grid (R Core Team, 2020), gridExtra (Auguie, 2017).
composition <- duchcov %>%
select(Li:Bi) %>%
as.matrix()
rownames(composition) <- duchcov$id
# elements selected for PCA and further analyses
elements <- c("Co", "Ni", "Zn", "As", "Ag", "Sb", "Pb")
DT::datatable(composition[, elements], caption = "Compositions table")Only Co, Ni, Zn, As, Ag, Sb, Pb element compositions of 49 analyzed artefacts from Duchcov hoard were used for the analyses.
corrplot::corrplot(cor(composition[, elements]),
type = "upper", diag = FALSE,
order = "hclust", method = "pie")# values are scaled to mean 0 and sd 1
composition_biplot <- scale(composition[, elements])
# solve overplotting in variable names on a biplot
colnames(composition_biplot)[4] <- "\nAs"
colnames(composition_biplot)[2] <- "\nNi"
pc_comp <- prcomp(composition_biplot)
# First 5 PCs are used in subsequent analysis to retain close to 95% of explained variance
summary(pc_comp)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.7159 1.1722 0.9935 0.9010 0.66765 0.51310 0.41717
## Proportion of Variance 0.4206 0.1963 0.1410 0.1160 0.06368 0.03761 0.02486
## Cumulative Proportion 0.4206 0.6169 0.7579 0.8739 0.93753 0.97514 1.00000
Elbow method and silhouette method are used to determine number of clusters in the dataset.
plot_scree_clusters <- function(x) {
wss <- 0
max_i <- (nrow(x) - 1)/4
for (i in 1:max_i) {
km.model <- kmeans(x, centers = i, nstart = 20)
wss[i] <- km.model$tot.withinss
}
plot(1:max_i, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
}
plot_scree_clusters(pc_comp$x[, 1:5])Figure 1: Scree plot
plot_sil_width <- function(x) {
sw <- 0
max_i <- (nrow(x) - 1)/4
for (i in 2:max_i) {
km.model <- cluster::pam(x = pc_comp$x, k = i)
sw[i] <- km.model$silinfo$avg.width
}
sw <- sw[-1]
plot(2:max_i, sw, type = "b",
xlab = "Number of Clusters",
ylab = "Average silhouette width")
}
plot_sil_width(pc_comp$x[, 1:5])Figure 2: Silhouettes
Both methods suggest optimal number of clusters in the dataset is 4 to 5, we chose to use k-means clustering with 4 clusters.
# ------------------------------------------------------------------------------
# please note: to avoid confusion in changes in k-means cluster numbers/colors
# the cluster assignment was hardcoded in a csv file 'clusters.csv'
# ------------------------------------------------------------------------------
# km_comp <- kmeans(pc_comp$x[, 1:5], centers = 4, nstart = 50)
# write_csv(tibble(kmeans = km_comp$cluster), "../data/clusters.csv")
cluster <- read_csv("../data/clusters.csv") %>%
mutate(kmeans = factor(kmeans))# ggplot(as_tibble(pc_comp$x), aes(PC1, PC2, fill = cluster$kmeans)) +
# geom_mark_hull(aes(color = cluster$kmeans), expand = unit(2.4, "mm")) +
# geom_point(aes(shape = cluster$kmeans, color = cluster$kmeans)) +
# scale_fill_brewer(palette = "Set1") + scale_color_brewer(palette = "Set1") +
# labs(fill = "k-means\ncluster", color = "k-means\ncluster", shape = "k-means\ncluster")Mean values for different element compositions in the clusters 1 to 4:
km_elements <- as_tibble(cbind(composition[, elements], cluster)) %>%
group_by(kmeans) %>%
summarize_all(mean) %>%
mutate_at(vars(-kmeans), round, 4)
knitr::kable(km_elements,
rownames = FALSE, caption = "Mean values for element compositions")| kmeans | Co | Ni | Zn | As | Ag | Sb | Pb |
|---|---|---|---|---|---|---|---|
| 1 | 184.7500 | 0.1987 | 0.0733 | 0.2133 | 0.0295 | 0.1443 | 2.3779 |
| 2 | 38.9677 | 0.0377 | 0.0548 | 0.0939 | 0.0112 | 0.0412 | 0.2274 |
| 3 | 0.0000 | 0.0263 | 1.8100 | 0.0567 | 0.0057 | 0.0370 | 0.1713 |
| 4 | 971.0000 | 0.0710 | 0.1900 | 0.5400 | 0.0303 | 0.0623 | 0.1907 |
isotopes <- duchcov %>%
select(id, starts_with("Pb2"))
isotopes <- bind_cols(isotopes, cluster)
DT::datatable(isotopes, rownames = FALSE)lab_206_207 <- labs(x = expression(paste(""^206, "Pb/", ""^204, "Pb")),
y = expression(paste(""^207, "Pb/", ""^204, "Pb")))
lab_206_208 <- labs(x = expression(paste(""^206, "Pb/", ""^204, "Pb")),
y = expression(paste(""^208, "Pb/", ""^204, "Pb")))
lab_207_208 <- labs(x = expression(paste(""^207, "Pb/", ""^206, "Pb")),
y = expression(paste(""^208, "Pb/", ""^206, "Pb")))# ggplot(isotopes, aes(Pb206_204, Pb207_204)) +
# geom_point(aes(color = kmeans, shape = kmeans), size = 1.6) +
# scale_color_brewer(palette = "Set1", name = "k-means\ncluster") +
# labs(shape = "k-means\ncluster") + lab_206_207
#
# ggplot(isotopes, aes(Pb206_204, Pb208_204)) +
# geom_point(aes(color = kmeans, shape = kmeans), size = 1.6) +
# scale_color_brewer(palette = "Set1", name = "k-means\ncluster") +
# labs(shape = "k-means\ncluster") + lab_206_208
#
# ggplot(isotopes, aes(Pb207_206, Pb208_206)) +
# geom_point(aes(color = kmeans, shape = kmeans), size = 1.6) +
# scale_color_brewer(palette = "Set1", name = "k-means\ncluster") +
# labs(shape = "k-means\ncluster") + lab_207_208Here we compare the distributions of Duchcov k-means clusters with distributions of known sources based on lead isotopes.
sources <- read_csv("../data/ore_sources.csv") %>%
mutate(Region = factor(Region,
levels = c("Valais", "Rheinland", "Saarland",
"Rammelsberg & Harz",
"E Alps", "SE Alps", "Erzgebirge",
"Slovakia", "E Carpathians")),
Region2 = factor(Region2,
levels = c("Valais", "Rheinland", "Saarland",
"Rammelsberg", "Harz",
"E Alps", "SE Alps", "Erzgebirge",
"Slovakia", "E Carpathians"))) %>%
drop_na()
# frame margins of further plots
dist_duchcov <- vector(mode = "list")
dist_duchcov$Pb206x$min <- min(isotopes$Pb206_204, na.rm = TRUE)
dist_duchcov$Pb206x$max <- max(isotopes$Pb206_204, na.rm = TRUE)
dist_duchcov$Pb207y$min <- min(isotopes$Pb207_204, na.rm = TRUE)
dist_duchcov$Pb207y$max <- max(isotopes$Pb207_204, na.rm = TRUE)
dist_duchcov$Pb208_204y$min <- min(isotopes$Pb208_204, na.rm = TRUE)
dist_duchcov$Pb208_204y$max <- max(isotopes$Pb208_204, na.rm = TRUE)
dist_duchcov$Pb208y$min <- min(isotopes$Pb208_206, na.rm = TRUE)
dist_duchcov$Pb208y$max <- max(isotopes$Pb208_206, na.rm = TRUE)
dist_duchcov$Pb207x$min <- min(isotopes$Pb207_206, na.rm = TRUE)
dist_duchcov$Pb207x$max <- max(isotopes$Pb207_206, na.rm = TRUE)Points removed as outliers are highlighted in red in the following figures. Outliers are defined as points lying further than 2 interquartile ranges from the upper quartile of the Mahalanobis distances from the data distribution center in a multidimensional space defined by isotopic signals \(^{206}Pb/^{204}Pb\), \(^{207}Pb/^{204}Pb\), \(^{208}Pb/^{204}Pb\), \(^{207}Pb/^{206}Pb\) and \(^{208}Pb/^{206}Pb\).
ggplot(sources, aes(Pb206_204, Pb207_204)) +
geom_point(aes(color = Outlier), alpha = 0.4) +
facet_wrap(~Region, scales = "free") +
scale_color_manual(values = c("forestgreen", "red")) +
lab_206_207Figure 3: Ore sources outliers
ggplot(sources, aes(Pb206_204, Pb208_204)) +
geom_point(aes(color = Outlier), alpha = 0.4) +
facet_wrap(~Region, scales = "free") +
scale_color_manual(values = c("forestgreen", "red")) +
lab_206_208Figure 4: Ore sources outliers
ggplot(sources, aes(Pb207_206, Pb208_206)) +
geom_point(aes(color = Outlier), alpha = 0.4) +
facet_wrap(~Region, scales = "free") +
scale_color_manual(values = c("forestgreen", "red")) +
lab_207_208Figure 5: Ore sources outliers
gg_206_207 <- ggplot(sources, aes(Pb206_204, Pb207_204)) +
lab_206_207
gg_206_208 <- ggplot(sources, aes(Pb206_204, Pb208_204)) +
lab_206_208
gg_207_208 <- ggplot(sources, aes(Pb207_206, Pb208_206)) +
lab_207_208Overall view of distributions of lead isotope values.
gg_206_207 +
geom_point(alpha = 0.2, aes(color = Region)) +
scale_color_viridis_d(direction = -1, end = 0.9) +
annotate("rect",
xmin = dist_duchcov$Pb206x$min, xmax = dist_duchcov$Pb206x$max,
ymin = dist_duchcov$Pb207y$min, ymax = dist_duchcov$Pb207y$max, alpha = 0.2) +
geom_point(data = isotopes, aes(Pb206_204, Pb207_204), shape = "cross")Figure 6: Ore sources overlayed with Duchcov hoard data
gg_206_208 +
geom_point(alpha = 0.2, aes(color = Region)) +
scale_color_viridis_d(direction = -1, end = 0.9) +
annotate("rect",
xmin = dist_duchcov$Pb206x$min, xmax = dist_duchcov$Pb206x$max,
ymin = dist_duchcov$Pb208_204y$min, ymax = dist_duchcov$Pb208_204y$max, alpha = 0.2) +
geom_point(data = isotopes, aes(Pb206_204, Pb208_204), shape = "cross")Figure 7: Ore sources overlayed with Duchcov hoard data
gg_207_208 +
geom_point(alpha = 0.2, aes(color = Region)) +
scale_color_viridis_d(direction = -1, end = 0.9) +
annotate("rect",
xmin = dist_duchcov$Pb207x$min, xmax = dist_duchcov$Pb207x$max,
ymin = dist_duchcov$Pb208y$min, ymax = dist_duchcov$Pb208y$max, alpha = 0.2) +
geom_point(data = isotopes, aes(Pb207_206, Pb208_206), shape = "cross")Figure 8: Ore sources overlayed with Duchcov hoard data
gg_206_207 <- gg_206_207 +
coord_cartesian(xlim = c(dist_duchcov$Pb206x$min, dist_duchcov$Pb206x$max),
ylim = c(dist_duchcov$Pb207y$min, dist_duchcov$Pb207y$max)) +
guides(shape = guide_legend(ncol=2), color = guide_legend(ncol = 2)) +
labs(shape = "K-means cluster")
gg_206_208 <- gg_206_208 +
coord_cartesian(xlim = c(dist_duchcov$Pb206x$min, dist_duchcov$Pb206x$max),
ylim = c(dist_duchcov$Pb208_204y$min, dist_duchcov$Pb208_204y$max)) +
guides(shape = guide_legend(ncol=2), color = guide_legend(ncol = 2)) +
labs(shape = "K-means cluster")
gg_207_208 <- gg_207_208 +
coord_cartesian(xlim = c(dist_duchcov$Pb207x$min, dist_duchcov$Pb207x$max),
ylim = c(dist_duchcov$Pb208y$min, dist_duchcov$Pb208y$max)) +
guides(shape = guide_legend(ncol=2), color = guide_legend(ncol = 2)) +
labs(shape = "K-means cluster")Several visual methods are explored in order to determine whether the lead isotopic signal from Duchcov hoard objects grouped by k-means clustering correspond to known sources. At first, areas corresponding to lead isotope signals from different geographic regions are constructed in isospaces using concave hulls (Gombin et al., 2017, concaveman algorithm) of the point distributions for given geographic regions.
# gg_206_207 +
# geom_point(alpha = 0.2, aes(color = Region2), show.legend = FALSE) +
# geom_mark_hull(aes(fill = Region2, color = Region2), expand = unit(2.4, "mm"),
# alpha = 0.1, show.legend = FALSE) +
# scale_color_viridis_d(direction = -1, end = 0.9) + scale_fill_viridis_d(direction = -1) +
# geom_point(data = isotopes, aes(shape = kmeans), alpha = 0.6) +
# facet_wrap(~Region)
#
# gg_206_208 +
# geom_point(alpha = 0.2, aes(color = Region2), show.legend = FALSE) +
# geom_mark_hull(aes(fill = Region2, color = Region2), expand = unit(2.4, "mm"),
# alpha = 0.1, show.legend = FALSE) +
# scale_color_viridis_d(direction = -1, end = 0.9) + scale_fill_viridis_d(direction = -1) +
# geom_point(data = isotopes, aes(shape = kmeans), alpha = 0.6) +
# facet_wrap(~Region)
#
# gg_207_208 +
# geom_point(alpha = 0.2, aes(color = Region2), show.legend = FALSE) +
# geom_mark_hull(aes(fill = Region2, color = Region2), expand = unit(2.4, "mm"),
# alpha = 0.1, show.legend = FALSE) +
# scale_color_viridis_d(direction = -1, end = 0.9) + scale_fill_viridis_d(direction = -1) +
# geom_point(data = isotopes, aes(shape = kmeans), alpha = 0.6) +
# facet_wrap(~Region)Kernel density estimation for each of the source regions helps in identifying to
what extent the point distribution of Duchcov hoard objects fits the density
distribution in given regions. kde2d algorithm of MASS package is used (Venables and Ripley, 2002).
gg_206_207 +
stat_density2d(aes(fill = stat(nlevel)), geom = "polygon", alpha = 0.2, show.legend = FALSE) +
scale_fill_gradient(low = "white", high = "black") +
geom_point(alpha = 0.2, size = 0.2, color = "black") +
geom_point(data = isotopes, aes(shape = kmeans, color = kmeans)) +
scale_color_brewer(palette = "Set1", name = "K-means cluster") +
facet_wrap(~ Region)Figure 9: Kernel density estimation of ore sources
gg_206_208 +
stat_density2d(aes(fill = stat(nlevel)), geom = "polygon", alpha = 0.2, show.legend = FALSE) +
scale_fill_gradient(low = "white", high = "black") +
geom_point(alpha = 0.2, size = 0.2, color = "black") +
geom_point(data = isotopes, aes(shape = kmeans, color = kmeans)) +
scale_color_brewer(palette = "Set1", name = "K-means cluster") +
facet_wrap(~ Region)Figure 10: Kernel density estimation of ore sources
gg_207_208 +
stat_density2d(aes(fill = stat(nlevel)), geom = "polygon", alpha = 0.2, show.legend = FALSE) +
scale_fill_gradient(low = "white", high = "black") +
geom_point(alpha = 0.2, size = 0.2, color = "black") +
geom_point(data = isotopes, aes(shape = kmeans, color = kmeans)) +
scale_color_brewer(palette = "Set1", name = "K-means cluster") +
facet_wrap(~ Region)Figure 11: Kernel density estimation of ore sources
Euclidean distance between all the points in given geographic region and k-means clusters from Duchcov hoard is used as one of the proxies to identify possible sources of the lead isotope ratios \(^{208}Pb/^{204}Pb\), \(^{207}Pb/^{204}Pb\), \(^{206}Pb/^{204}Pb\), \(^{208}Pb/^{206}Pb\) and \(^{207}Pb/^{206}Pb\).
The problem with using Euclidean distances is that the multivariate shape of the data distribution is not taken into account, thus Mahalanobis distance (see further) seems as a better fit for the task, even though Euclidean distances are commonly used (e.g. Ling et al., 2014).
nest_matrix <- function(df, gr) {
output <- df %>% group_by(!!sym(gr)) %>%
nest() %>%
mutate(mx = map(data, as.matrix))
return(output)
}
point_cloud_distance <- function(origin, goal, group_origin, group_goal, method) {
# create nested matrices
x <- nest_matrix(origin, group_origin)
y <- nest_matrix(goal, group_goal)
lx <- nrow(x)
ly <- nrow(y)
# create empty list for results
distance <- vector(mode = "list", length = ly)
for (k in seq_along(distance)) {
distance[[k]] <- vector(mode = "list", length = lx)
names(distance[[k]]) <- x %>% pull(!!sym(group_origin))
}
names(distance) <- y %>% pull(!!sym(group_goal)) %>% str_c("K-means cluster ", .)
# count euclidean distances
if (method == "euclidean") {
for (i in 1:ly) {
for (j in 1:lx) {
distance[[i]][[j]] <- pdist::pdist(X = y$mx[[i]], Y = x$mx[[j]])@dist
}
distance[[i]] <- bind_rows(lapply(distance[[i]], as_tibble), .id = "region")
}
# mahalanobis distance
} else if (method == "mahalanobis") {
for (i in 1:ly) {
for (j in 1:lx) {
distance[[i]][[j]] <- StatMatch::mahalanobis.dist(data.x = y$mx[[i]],
data.y = x$mx[[j]])
}
distance[[i]] <- bind_rows(
lapply(lapply(distance[[i]], as.vector), as_tibble), .id = "region")
}
}
distance <- bind_rows(distance, .id = "kmeans")
return(distance)
}src <- sources %>% select(Region2, starts_with("Pb"))
iso <- isotopes %>% select(kmeans, starts_with("Pb")) %>% na.omit()
euclidean_distance <- point_cloud_distance(origin = src, goal = iso,
group_origin = "Region2", group_goal = "kmeans",
method = "euclidean") %>%
mutate(region = fct_relevel(region, levels(sources$Region2)))# ggplot(euclidean_distance, aes(y = region, x = value, fill = region)) +
# ggridges::geom_density_ridges(alpha = 0.4, show.legend = FALSE, scale = 1.1) +
# scale_fill_viridis_d(direction = -1, end = 0.9) +
# coord_cartesian(xlim = c(0, 1)) +
# facet_wrap(~kmeans, scales = "fixed") +
# labs(y = "Region",
# x = "Euclidean distance to a given Duchcov K-means cluster (x axis limited to interval 0 - 1)")Mahalanobis distance is used as a metric to compare points fit in a given source distribution. The scale differences and effects of correlation between the variables are removed in case of Mahalanobis distance.
mahalanobis_distance <- point_cloud_distance(origin = src, goal = iso,
group_origin = "Region2", group_goal = "kmeans",
method = "mahalanobis") %>%
mutate(region = fct_relevel(region, levels(sources$Region2)))# ggplot(mahalanobis_distance, aes(y = region, x = value, fill = region)) +
# ggridges::geom_density_ridges(alpha = 0.4, show.legend = FALSE, scale = 1.1) +
# scale_fill_viridis_d(direction = -1, end = 0.9) +
# coord_cartesian(xlim = c(0, 10)) +
# facet_wrap(~kmeans, scales = "fixed") +
# labs(y = "Region",
# x = "Mahalanobis distance to a given Duchcov K-means cluster (x axis limited to interval 0 - 10)")Manova is used in order to determine whether there are significant differences between group means for the distinct regions. Most of the group means are significantly different in most of the dimensions defined by different lead isotopes. Region means that are not sig. different are Slovakia, Easter Carpathians and South-Eastern Alpes (see results below). Pillai test on MANOVA is significant, thus the region means combined for all the variables (lead isotopic values) are significantly different.
med_d_all <- mahalanobis_distance %>%
dplyr::summarize(median = median(value, na.rm = TRUE)) %>%
pull(median)
med_d_regions <- mahalanobis_distance %>% group_by(region) %>%
dplyr::summarize(med_d = median(value, na.rm = TRUE))
med_d_regions %>%
ggplot(aes(x = med_d, y = fct_reorder(region, med_d))) +
geom_point() +
geom_vline(xintercept = med_d_all) +
labs(x = "Median MD", y = "Region")Only regions with median Mahalanobis distance smaller than 5 are included in LDA with exception of Slovakia. Slovakia has leser Mahalanobis distance, but is removed from the test, because according to MANOVA, it is the only one group that is not significantly different.
sources_subset <- sources %>% filter(Region %in% filtered_regions$region) %>%
mutate(Region = fct_drop(Region))
sources_lm <- lm(as.matrix(sources_subset[, c("Pb207_206", "Pb208_206",
"Pb206_204", "Pb207_204", "Pb208_204")]) ~ sources_subset$Region)
summary(sources_lm)## Response Pb207_206 :
##
## Call:
## lm(formula = Pb207_206 ~ sources_subset$Region)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.083573 -0.004612 -0.000499 0.006032 0.051416
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.838894 0.001731 484.675 < 2e-16 ***
## sources_subset$RegionSaarland 0.012405 0.002171 5.715 1.84e-08 ***
## sources_subset$RegionE Alps -0.027991 0.002152 -13.005 < 2e-16 ***
## sources_subset$RegionSE Alps 0.021184 0.002178 9.724 < 2e-16 ***
## sources_subset$RegionErzgebirge 0.013404 0.002222 6.032 3.04e-09 ***
## sources_subset$RegionE Carpathians -0.005909 0.002320 -2.547 0.0112 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01329 on 530 degrees of freedom
## Multiple R-squared: 0.6338, Adjusted R-squared: 0.6304
## F-statistic: 183.5 on 5 and 530 DF, p-value: < 2.2e-16
##
##
## Response Pb208_206 :
##
## Call:
## lm(formula = Pb208_206 ~ sources_subset$Region)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.156006 -0.007665 0.000987 0.010358 0.099886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.070344 0.003227 641.553 < 2e-16 ***
## sources_subset$RegionSaarland 0.018247 0.004047 4.509 8.04e-06 ***
## sources_subset$RegionE Alps -0.031838 0.004013 -7.934 1.27e-14 ***
## sources_subset$RegionSE Alps 0.038589 0.004062 9.501 < 2e-16 ***
## sources_subset$RegionErzgebirge 0.022366 0.004143 5.398 1.02e-07 ***
## sources_subset$RegionE Carpathians -0.008377 0.004326 -1.936 0.0534 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02479 on 530 degrees of freedom
## Multiple R-squared: 0.4945, Adjusted R-squared: 0.4898
## F-statistic: 103.7 on 5 and 530 DF, p-value: < 2.2e-16
##
##
## Response Pb206_204 :
##
## Call:
## lm(formula = Pb206_204 ~ sources_subset$Region)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2068 -0.1429 0.0097 0.1046 2.3620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.68180 0.04445 420.281 < 2e-16 ***
## sources_subset$RegionSaarland -0.33292 0.05575 -5.972 4.30e-09 ***
## sources_subset$RegionE Alps 0.72722 0.05527 13.157 < 2e-16 ***
## sources_subset$RegionSE Alps -0.46861 0.05595 -8.376 4.93e-16 ***
## sources_subset$RegionErzgebirge -0.37794 0.05707 -6.622 8.68e-11 ***
## sources_subset$RegionE Carpathians 0.11216 0.05959 1.882 0.0604 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3414 on 530 degrees of freedom
## Multiple R-squared: 0.6252, Adjusted R-squared: 0.6216
## F-statistic: 176.8 on 5 and 530 DF, p-value: < 2.2e-16
##
##
## Response Pb207_204 :
##
## Call:
## lm(formula = Pb207_204 ~ sources_subset$Region)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.130548 -0.012688 0.000218 0.011287 0.116907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.667000 0.003148 4976.260 <2e-16 ***
## sources_subset$RegionSaarland -0.046937 0.003948 -11.888 <2e-16 ***
## sources_subset$RegionE Alps 0.055741 0.003915 14.238 <2e-16 ***
## sources_subset$RegionSE Alps -0.004218 0.003963 -1.064 0.2876
## sources_subset$RegionErzgebirge -0.067402 0.004042 -16.675 <2e-16 ***
## sources_subset$RegionE Carpathians -0.012541 0.004221 -2.971 0.0031 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02418 on 530 degrees of freedom
## Multiple R-squared: 0.747, Adjusted R-squared: 0.7446
## F-statistic: 313 on 5 and 530 DF, p-value: < 2.2e-16
##
##
## Response Pb208_204 :
##
## Call:
## lm(formula = Pb208_204 ~ sources_subset$Region)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1535 -0.1453 0.0301 0.1258 1.8033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.66768 0.04576 845.053 < 2e-16 ***
## sources_subset$RegionSaarland -0.34485 0.05739 -6.009 3.47e-09 ***
## sources_subset$RegionE Alps 0.87200 0.05690 15.325 < 2e-16 ***
## sources_subset$RegionSE Alps -0.26015 0.05759 -4.517 7.73e-06 ***
## sources_subset$RegionErzgebirge -0.36390 0.05875 -6.194 1.18e-09 ***
## sources_subset$RegionE Carpathians 0.08348 0.06134 1.361 0.174
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3515 on 530 degrees of freedom
## Multiple R-squared: 0.6346, Adjusted R-squared: 0.6312
## F-statistic: 184.1 on 5 and 530 DF, p-value: < 2.2e-16
## Df Pillai approx F num Df den Df Pr(>F)
## sources_subset$Region 5 1.6176 50.692 25 2650 < 2.2e-16 ***
## Residuals 530
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# summary(manova(sources_lm), test = "Wilks")
# summary(manova(sources_lm), test = "Roy")
# summary(manova(sources_lm), test = "Hotelling-Lawley")Linear discriminant analysis is then used to separate the regions.
n_levels_sources <- length(levels(sources_subset$Region))
sources_lda <- MASS::lda(Region ~ ., sources_subset[, c("Region",
"Pb207_206", "Pb208_206",
"Pb206_204", "Pb207_204", "Pb208_204")],
prior = as.double(
paste(
rep(1/n_levels_sources,
n_levels_sources), sep = ",")))
sources_lda## Call:
## lda(Region ~ ., data = sources_subset[, c("Region", "Pb207_206",
## "Pb208_206", "Pb206_204", "Pb207_204", "Pb208_204")], prior = as.double(paste(rep(1/n_levels_sources,
## n_levels_sources), sep = ",")))
##
## Prior probabilities of groups:
## Valais Saarland E Alps SE Alps Erzgebirge
## 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
## E Carpathians
## 0.1666667
##
## Group means:
## Pb207_206 Pb208_206 Pb206_204 Pb207_204 Pb208_204
## Valais 0.8388941 2.070344 18.68180 15.66700 38.66768
## Saarland 0.8512989 2.088591 18.34888 15.62006 38.32283
## E Alps 0.8109029 2.038506 19.40902 15.72274 39.53968
## SE Alps 0.8600777 2.108933 18.21319 15.66278 38.40752
## Erzgebirge 0.8522983 2.092711 18.30386 15.59960 38.30378
## E Carpathians 0.8329849 2.061967 18.79396 15.65446 38.75116
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4 LD5
## Pb207_206 42.302498 1133.49216 -1647.7969713 5038.1390 -1032.98177
## Pb208_206 73.962760 -315.07897 642.9876399 -3150.1984 921.79432
## Pb206_204 11.030219 10.84895 -0.8771077 -118.8825 57.04978
## Pb207_204 -56.005454 -17.89743 102.1053150 -245.4064 54.79130
## Pb208_204 -3.933951 15.60876 -36.4779825 161.5616 -50.66278
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5
## 0.5872 0.3696 0.0343 0.0085 0.0004
The LDA works, but the there are problems in classification of points, meaning it is not accurate. (LDA is limited to the data distributions of regions that are the most proximal to the points of interest according to Mahalanobis distance). Using the LDA to predict the regions, we get a lot of error, see the confusion matrix below. How to read it: rows are the regions the point is from and columns are regions predicted based on the LDA.
sources_pred <- predict(sources_lda)
prediction <- with(sources_pred, data.frame(Region = sources_subset$Region,
Predict = class, round(posterior, 2)))
knitr::kable(addmargins(xtabs(~fct_drop(Region) + fct_drop(Predict), prediction)),
caption = "Confusion matrix")| Valais | Saarland | E Alps | SE Alps | Erzgebirge | E Carpathians | Sum | |
|---|---|---|---|---|---|---|---|
| Valais | 37 | 3 | 3 | 3 | 1 | 12 | 59 |
| Saarland | 3 | 82 | 0 | 2 | 15 | 1 | 103 |
| E Alps | 21 | 0 | 81 | 2 | 0 | 4 | 108 |
| SE Alps | 9 | 1 | 0 | 91 | 0 | 0 | 101 |
| Erzgebirge | 0 | 16 | 0 | 1 | 72 | 2 | 91 |
| E Carpathians | 20 | 1 | 0 | 0 | 6 | 47 | 74 |
| Sum | 90 | 103 | 84 | 99 | 94 | 66 | 536 |
# overall success of the model in predicting correct region
round(mean(prediction$Region == prediction$Predict), 2)## [1] 0.76
# individual correct predictions (diagonal)
round(prop.table(table(droplevels(prediction$Region),
droplevels(prediction$Predict)),
margin = 1), 2)##
## Valais Saarland E Alps SE Alps Erzgebirge E Carpathians
## Valais 0.63 0.05 0.05 0.05 0.02 0.20
## Saarland 0.03 0.80 0.00 0.02 0.15 0.01
## E Alps 0.19 0.00 0.75 0.02 0.00 0.04
## SE Alps 0.09 0.01 0.00 0.90 0.00 0.00
## Erzgebirge 0.00 0.18 0.00 0.01 0.79 0.02
## E Carpathians 0.27 0.01 0.00 0.00 0.08 0.64
Prediction of Regions based on data on points from Duchcov hoard.
duchcov_prediction <- predict(object = sources_lda, newdata = isotopes[, -7])
duchcov_prediction_df <- with(duchcov_prediction,
data.frame(id = isotopes[, "id"],
kmeans = isotopes[, "kmeans"],
predicted_region = class,
round(posterior, 2)))
# posterior probabilities of individual predictions
DT::datatable(duchcov_prediction_df,
caption = "Posterior probabilities of predicted sources")duchcov_prediction_df %>%
select(-predicted_region) %>%
gather(key = "region", value = "probability", -id, -kmeans) %>%
group_by(kmeans, region) %>%
dplyr::summarize(mean_prob = mean(probability, na.rm = TRUE)) %>%
ggplot(aes(x = region, y = mean_prob)) +
geom_col(fill = "white", col = "black") +
facet_wrap(~kmeans, ncol = 1) +
labs(x = "Region", y = "Mean posterior probability")# matrix of counts of points in individual k-means clusters (rows)
# fitting to various regions (columns)
knitr::kable(addmargins(xtabs(~kmeans + fct_drop(predicted_region),
duchcov_prediction_df)),
caption = "Counts of points per different sources")| Valais | Saarland | SE Alps | Erzgebirge | E Carpathians | Sum | |
|---|---|---|---|---|---|---|
| 1 | 8 | 0 | 1 | 0 | 2 | 11 |
| 2 | 7 | 13 | 0 | 8 | 3 | 31 |
| 3 | 0 | 1 | 0 | 2 | 0 | 3 |
| 4 | 0 | 1 | 0 | 1 | 1 | 3 |
| Sum | 15 | 15 | 1 | 11 | 6 | 48 |
ggplot(isotopes, aes(Pb206_204, Pb207_204, label = kmeans)) +
geom_text(aes(color = duchcov_prediction_df$predicted_region,
shape = duchcov_prediction_df$predicted_region)) +
scale_color_brewer(palette = "Set1", name = "predicted\nregion") +
labs(shape = "predicted\nregion") + lab_206_207Figure 12: Source regions as assigned to Duchcov k-means clusters (numeric labels) by LDA
ggplot(isotopes, aes(Pb206_204, Pb208_204, label = kmeans)) +
geom_text(aes(color = duchcov_prediction_df$predicted_region,
shape = duchcov_prediction_df$predicted_region)) +
scale_color_brewer(palette = "Set1", name = "predicted\nregion") +
labs(shape = "predicted\nregion") + lab_206_208Figure 13: Source regions as assigned to Duchcov k-means clusters (numeric labels) by LDA
ggplot(isotopes, aes(Pb207_206, Pb208_206, label = kmeans)) +
geom_text(aes(color = duchcov_prediction_df$predicted_region,
shape = duchcov_prediction_df$predicted_region)) +
scale_color_brewer(palette = "Set1", name = "predicted\nregion") +
labs(shape = "predicted\nregion") + lab_207_208Figure 14: Source regions as assigned to Duchcov k-means clusters (numeric labels) by LDA
duchcov <- duchcov %>% transmute(brooch = if_else(str_detect(type_eng, "^brooch_"),
"brooch", "not brooch"),
brooch = factor(brooch, levels = c("brooch", "not brooch")),
id = id) %>%
left_join(isotopes)
burials <- burials %>% mutate(brooch = if_else(str_detect(type_eng, "brooch_"),
"brooch", "not brooch"),
brooch = factor(brooch, levels = c("brooch", "not brooch"))) %>%
select(-type_eng)g_bur_206_207 <- ggplot(burials, aes(Pb206_204, Pb207_204)) +
labs(color = "Burials", shape = "Duchcov k-means\ncluster") +
lab_206_207 +
theme(legend.position = c(0.84, 0.2)) +
guides(shape = guide_legend(ncol=2), color = guide_legend(ncol = 2))
g_bur_206_208 <- ggplot(burials, aes(Pb206_204, Pb208_204)) +
labs(color = "Burials", shape = "Duchcov k-means\ncluster") +
lab_206_208 +
theme(legend.position = c(0.84, 0.2)) +
guides(shape = guide_legend(ncol=2), color = guide_legend(ncol = 2))
g_bur_207_208 <- ggplot(burials, aes(Pb207_206, Pb208_206)) +
labs(color = "Burials", shape = "Duchcov k-means\ncluster") +
lab_207_208 +
theme(legend.position = c(0.84, 0.2)) +
guides(shape = guide_legend(ncol=2), color = guide_legend(ncol = 2))g_bur_206_207 + geom_point(aes(color = site), alpha = 0.8, show.legend = FALSE) +
geom_mark_hull(aes(color = site, fill = site), expand = unit(2.4, "mm"),
alpha = 0.1, show.legend = FALSE) +
scale_color_brewer(palette = "Set1") + scale_fill_brewer(palette = "Set1") +
geom_point(data = duchcov, aes(shape = kmeans), size = 2, alpha = 0.4) +
facet_wrap(~site)Figure 15: Distribution of data from different burials (LTB1 horizon) overlayed with Duchcov hoard
g_bur_206_208 + geom_point(aes(color = site), alpha = 0.8, show.legend = FALSE) +
geom_mark_hull(aes(color = site, fill = site), expand = unit(2.4, "mm"),
alpha = 0.1, show.legend = FALSE) +
scale_color_brewer(palette = "Set1") + scale_fill_brewer(palette = "Set1") +
geom_point(data = duchcov, aes(shape = kmeans), size = 2, alpha = 0.4) +
facet_wrap(~site)Figure 16: Distribution of data from different burials (LTB1 horizon) overlayed with Duchcov hoard
g_bur_207_208 + geom_point(aes(color = site), alpha = 0.8, show.legend = FALSE) +
geom_mark_hull(aes(color = site, fill = site), expand = unit(2.4, "mm"),
alpha = 0.1, show.legend = FALSE) +
scale_color_brewer(palette = "Set1") + scale_fill_brewer(palette = "Set1") +
geom_point(data = duchcov, aes(shape = kmeans), size = 2, alpha = 0.4) +
facet_wrap(~site)Figure 17: Distribution of data from different burials (LTB1 horizon) overlayed with Duchcov hoard
# g_bur_206_207 +
# stat_density2d(aes(fill = stat(nlevel)), geom = "polygon", alpha = 0.2, show.legend = FALSE) +
# scale_fill_gradient(low = "white", high = "black") +
# scale_color_brewer(palette = "Set1") +
# geom_point(aes(color = site), alpha = 0.8, show.legend = FALSE) +
# geom_point(data = duchcov, aes(shape = kmeans), size = 1.4, alpha = 0.4) +
# facet_wrap(~site)
#
# g_bur_206_208 +
# stat_density2d(aes(fill = stat(nlevel)), geom = "polygon", alpha = 0.2, show.legend = FALSE) +
# scale_fill_gradient(low = "white", high = "black") +
# scale_color_brewer(palette = "Set1") +
# geom_point(aes(color = site), alpha = 0.8, show.legend = FALSE) +
# geom_point(data = duchcov, aes(shape = kmeans), size = 1.4, alpha = 0.4) +
# facet_wrap(~site)
#
# g_bur_207_208 +
# stat_density2d(aes(fill = stat(nlevel)), geom = "polygon", alpha = 0.2, show.legend = FALSE) +
# scale_fill_gradient(low = "white", high = "black") +
# scale_color_brewer(palette = "Set1") +
# geom_point(aes(color = site), alpha = 0.8, show.legend = FALSE) +
# geom_point(data = duchcov, aes(shape = kmeans), size = 1.4, alpha = 0.4) +
# facet_wrap(~site)g_bro_206_207 <- ggplot(origin, aes(Pb206_204, Pb207_204, color = origin)) +
facet_wrap(~ brooch) +
labs(color = "Origin") +
lab_206_207
g_bro_206_208 <- ggplot(origin, aes(Pb206_204, Pb208_204, color = origin)) +
labs(color = "Origin") +
facet_wrap(~ brooch) +
lab_206_208
g_bro_207_208 <- ggplot(origin, aes(Pb207_206, Pb208_206, color = origin)) +
labs(color = "Origin") +
facet_wrap(~ brooch) +
lab_207_208Figure 18: Distribution of brooches and other artefact types in burials and Duchcov hoard
Figure 19: Distribution of brooches and other artefact types in burials and Duchcov hoard
Figure 20: Distribution of brooches and other artefact types in burials and Duchcov hoard
Auguie, B., 2017. GridExtra: Miscellaneous functions for "grid" graphics.
D’Orazio, M., 2019. StatMatch: Statistical matching or data fusion.
Gombin, J., Vaidyanathan, R., Agafonkin, V., 2017. Concaveman: A very fast 2D concave hull algorithm.
Ling, J., Stos-Gale, Z., Grandin, L., Billström, K., Hjärthner-Holdar, E., Persson, P.-O., 2014. Moving metals ii: Provenancing scandinavian bronze age artefacts by lead isotope and elemental analyses. Journal of Archaeological Science 41, 106–132. https://doi.org/https://doi.org/10.1016/j.jas.2013.07.018
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K., 2019. Cluster: Cluster analysis basics and extensions.
Pedersen, T.L., 2019. Ggforce: Accelerating ’ggplot2’.
R Core Team, 2020. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
Venables, W.N., Ripley, B.D., 2002. Modern applied statistics with s, Fourth. ed. Springer, New York.
Vu, V.Q., 2011. Ggbiplot: A ggplot2 based biplot.
Wei, T., Simko, V., 2017. R package "corrplot": Visualization of a correlation matrix.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T.L., Miller, E., Bache, S.M., Müller, K., Ooms, J., Robinson, D., Seidel, D.P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., Yutani, H., 2019. Welcome to the tidyverse. Journal of Open Source Software 4, 1686. https://doi.org/10.21105/joss.01686
Wilke, C.O., 2020. Ggridges: Ridgeline plots in ’ggplot2’.
Wilke, C.O., 2019. Cowplot: Streamlined plot theme and plot annotations for ’ggplot2’.
Wong, J., 2013. Pdist: Partitioned distance function.
Xie, Y., 2020a. Knitr: A general-purpose package for dynamic report generation in r.
Xie, Y., 2020b. Bookdown: Authoring books and technical documents with r markdown.
Xie, Y., Cheng, J., Tan, X., 2020. DT: A wrapper of the javascript library ’datatables’.